Jose Medina - Milestone 1, 2, and Final Milestone

Note to grader: please scroll to end of notebook for final capstone section.

Context


Global warming threatens to irreversibly disrupt the fragile balance that exists in natural ecosystems which are vital to sustaining life on our planet.

Scientific evidence indicates that the primary contributor to global warming and rising temperatures is CO$_{2}$. This greenhouse gas contributes to the "Greenhouse Effect" whereby the sun's warmth is trapped in the lower atmosphere leading to rising temperatures. The excessive levels of CO$_{2}$ that lead to the Greenhouse effect are a consequence of human economic and industrial activity which produce significant amounts of the gas.

The energy industry contributes a significant amount of CO$_{2}$ through the burning of fossil fuels to generate energy, though cleaner energy sources and more sustainable policy approaches exist.

With our lens focused on the energy industry, understanding the sources of CO$_{2}$ and being able to forecast the trajectory of CO$_{2}$ production can be very useful in aiding public policy and industrial decision-making with respect to emissions reduction strategies that range from cleaner energy sources to incentives for reductions in CO$_{2}$

Objective

Jose Medina Key questions

Question about the data in general:

  • What is the nature of our data? (data types, timeframes, etc.)
    • Do we have a comprehensive dataset?
    • For example, do we have any missing values?
      • If we are missing values, is the amount that is missing a sizeable portion such that it could affect our analysis?
      • In particular, if we're missing data, how much of the missing data affects the set of variable that we're particularly interested in forecasting - in this case being natural gas emissions?

Questions about the trending:

  • How are the different fuel type emissions changing over time?
  • How does natural gas emissions trend in general, and how does it compare to other fuel types?
  • Does it appear to follow a pattern? The answer could be helpful in our eventual modelling/forecasting steps. trending up, down, not a all? Do we notice any seasonality?

Jose Medina Problem Formulation:

Employing data science techniques, we wish to explore whether it is feasible to develop a useful model that predicts Natural Gas Emissions for the next 12 months

To do this we'll need to explore modeling techniques, including ARIMA, and establishing approaches that:

  • Test for stationary
  • If necessary, transform or manipulate the dataset appropriately to introduce stationarity
    • Techniques may include log transformation and/or differencing the series to account for any reoccuring patterns < - Which, in turn, will allow us to develop appropriate modeling techniques to produce an accurate forecast

Lastly, we'll need to test the efficacy of our model by comparing it to a hold out data set from the timeseries that we can use to develop accuracy measures such as RMSE or MAE.

Assuming we are able to develop a useful model, we can then use the results to develop a set of recommendations for policy makers with respect to the trajectory we expect for natural gas emissions over the next 12 months.

Attributes Information:

This datset is the past monthly data of Carbon dioxide emissions from electricity generation from the US Energy Information Administration categorized by fuel type such as Coal, Natural gas etc.

MSN:- Reference to Mnemonic Series Names (U.S. Energy Information Administration Nomenclature)

YYYYMM:- The month of the year on which these emissions were observed

Value:- Amount of CO2 Emissions in Million Metric Tons of Carbon Dioxide

Description:- Different category of electricity production through which carbon is emissioned.

Important Notes

Loading the libraries

Loading the data

The arguments can be explained as:

Medina Observations

  • We ran pandas profofiling which revealed the followingL
  • We observe that there are 384 missing values for the "Value" column representing 8.2% of the values.

Jose Medina Comments:

We see from .info() that we're missing 384 in the Values field

  • That represents 384/4707 = 8.2% of the dataset
  • 384 is equivalent to removing 1 years worth of data from the data set - though it may be scatter across fuel types.
  • Before blindly dropping the data, we should see what proportion of values are missing with respect to Natural Gas Emissions - because that is the Fuel Type we're specifically interested in.

Good, we see that for Natural Gas in particular, we are not missing any values

Jose Medina - Custom Section

  • Running Pandas Profiling to develop a broader understanding, quickly, of:
    • Missing values
    • Distributions
    • Basic correlations

Dataset visualization

Visualize the dependency of the emission in the power generation with time.

Jose Medina - complementary analysis

  • It appears from the data that there may be some seasonality to the trending,
  • Lets explore this quickly by creating a seasonal feature that might explain the saw-tooth pattern that we're seeing across the different fuels. We'll isolate this to just Coal and Natural Gas for now.

Visualize the trend of CO2 emission from each energy source individually

Jose Medina Observations and Insights:

General Trending

  • Coal usage exceeds all other fuel type emissions followed by natural gas
  • Though coal dominates as a source of emissions it appears to be on a downward trend since ~2008 — It may be worthwhile to explore what, if any, policy or economic changes might have inspired a drop on or about that time.
  • In contrast, we observe that Natural Gas has been steadily increasing In that same timeframe - with a marked increase starting on or about 2008.
  • Additionally, Coal, Petroleum, Petroleum Coke, Residual Fuel Oil, and Distillate Fuel have all declined in emissions
  • Meanwhile, Geothermal, Natural Gas have both increased, while non-biomass waste has held relatively flat over the past ~10 years.

Seasonal Trending

  • Boxplot analysis to inspect seasonality of Natural Gas and Coal indicates that our top emissions periods occur in the Summer for both
  • For Natural Gas in particular, we see seasons are Spring/Fall & Summer (in other words, hotter weather seems to inspire more fuel use)

Bar chart of CO2 Emissions per energy source

For developing the time series model and forcasting, use the natural gas CO2 emission from the electirical power generation

Jose Medina Comments

  • Let's test the stationary of the series by exploring some rolling means and standard deviations
  • Let's first split out a training / test portion so we can validate the modelling later

Jose Medina Observations & insights:

  • We can see there is an upward trend in the series in the moving average.
  • However, the standard deviation is almost constant which implies that now the series has constant variance.
  • We can confirm that the series is not stationary.
  • We can also use the Augmented Dickey-Fuller (ADF) Test to verify if the series is stationary or not. The null and alternate hypotheses for the ADF Test are defined as:

Null hypothesis: The Time Series is non-stationary Alternative hypothesis: The Time Series is stationary

Jose Medina Observations:

  • From the above test, we can see that the p-value = 1 i.e. > 0.05 (For 95% confidence intervals) therefore, we fail to reject the null hypothesis.
  • Hence, we can confirm that the series is non-stationary.

Jose Medina Proposed approach

Potential techniques

  • We now know that the series is not stationary so we'll need work with the data to accomplish this task
  • Techniques may include log transformation and/or differencing the series
  • In the next milestone, we should attempt to decompose the time series into Trend, Seasonality, and Residual to attempt to make it stationary and predictable.
  • Using plots such as acf, and pacf will help us understand auto-correlation and partial auto-correlation approaches.

Overall solution design

  • We'll need to go through the iterative steps of model development to be certain of the solution design, but, most likely:
    • We will likely build an ARIMA model that takes range of dates as an input
    • Somehow, we'll need to be considerate of seasonality (for example - peak usage of Natural Gas in Summer)
    • We may need to consider the potential for an acceleration in the use of alternative/renewable resources that are NOT reflected in this dataset. This would serve as an offset to Natural Gas, as well as all other fossil bases resources. This may be difficult to estimate, but we could try to add a "bias" or "what-if" component to the forecast

Measures of success

  • We'll want to measure success on both quantitative as well as qualitative aspects Quantitative Criteria
  • We will want to compare our model forecasts for 12 months
  • We will do this by evaluating the RMSE.

Qualitative Criteria

  • We will want to ensure that we can explain the model in an intuitive manner and that the client can interact with it in a useful manner
  • We will want to propose some interactivity/what-if scenario capability in the notebook or in a complemetary application that helps the client obtain deeper and iterative insights.
  • Generate energy policy recommendations based on the insights of the model

Jose Medina - Milestone 2 BEGINS HERE

  1. Note to evaluator of milestone 2 - It appears many of the exercises milestone 2 recommends are things I already have completed in milestone #1.
  2. To make it easier for the grader, I will will simply repeat those code blocks that are relevant for milestone #2 AGAIN

</font>